set.seed(12345)
source("ingest_data.R")
MODEL_CACHE <- "removed-M_intercepts_only"

Settings

Models are stored in the following directory, which must exist prior to knitting this document:

cat(normalizePath(paste(getwd(), dirname(cachefile(MODEL_CACHE)), sep="/"), mustWork = T))
## /home/app/.cache

The used cache directory can be controlled via the cache parameter to Rmd - it can be useful to experiment with this parameter if you Knit the document manually in RStudio.

Intercept-only model (baseline)

The simplest possible model, only intercepts (on population, committerteam and team-in-repo level):

This model assumes that each team-repository combination has a unique intercept, based on the population-level intercept, plus a team-level, and team-per-repo-level offset from that intercept. This model does not take into account any statistics about the actual change - it only assumes “when team T changes in repo R, then the likelihood of introducing duplicates are y”.

We use identical formulas and priors for the zero-inflation part of the equation (as the number of zeros dominate in the data).

This is our formula, including the brms the default priors, which we will change below.

d <- data |> select(y=REMOVED,
                    team=committerteam,
                    repo=repo)
formula <- bf(y ~ 1 + (1 | team) + (1 | team:repo),
              zi ~ 1 + (1 | team) + (1 | team:repo))
get_prior(data=d,
          family=zero_inflated_negbinomial,
          formula=formula)
##                    prior     class      coef     group resp dpar nlpar lb ub
##  student_t(3, -2.3, 2.5) Intercept                                          
##     student_t(3, 0, 2.5)        sd                                      0   
##     student_t(3, 0, 2.5)        sd                team                  0   
##     student_t(3, 0, 2.5)        sd Intercept      team                  0   
##     student_t(3, 0, 2.5)        sd           team:repo                  0   
##     student_t(3, 0, 2.5)        sd Intercept team:repo                  0   
##        gamma(0.01, 0.01)     shape                                      0   
##           logistic(0, 1) Intercept                            zi            
##     student_t(3, 0, 2.5)        sd                            zi        0   
##     student_t(3, 0, 2.5)        sd                team        zi        0   
##     student_t(3, 0, 2.5)        sd Intercept      team        zi        0   
##     student_t(3, 0, 2.5)        sd           team:repo        zi        0   
##     student_t(3, 0, 2.5)        sd Intercept team:repo        zi        0   
##        source
##       default
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default
##       default
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)

We adjust the priors accordingly:

priors <- c(prior(normal(0, 0.5), class = Intercept),
            prior(weibull(2, .25), class = sd),
            prior(normal(0, 0.5), class = Intercept, dpar=zi),
            prior(weibull(2, 0.25), class = sd, dpar=zi),
            prior(gamma(0.5, 0.1), class = shape))

validate_prior(prior=priors,
               formula=formula,
               data=d,
               family=zero_inflated_negbinomial)
##             prior     class      coef     group resp dpar nlpar lb ub
##    normal(0, 0.5) Intercept                                          
##    normal(0, 0.5) Intercept                            zi            
##  weibull(2, 0.25)        sd                                      0   
##  weibull(2, 0.25)        sd                            zi        0   
##  weibull(2, 0.25)        sd                team                  0   
##  weibull(2, 0.25)        sd Intercept      team                  0   
##  weibull(2, 0.25)        sd                team        zi        0   
##  weibull(2, 0.25)        sd Intercept      team        zi        0   
##  weibull(2, 0.25)        sd           team:repo                  0   
##  weibull(2, 0.25)        sd Intercept team:repo                  0   
##  weibull(2, 0.25)        sd           team:repo        zi        0   
##  weibull(2, 0.25)        sd Intercept team:repo        zi        0   
##   gamma(0.5, 0.1)     shape                                      0   
##        source
##          user
##          user
##          user
##          user
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##          user

Using the brms default shape prior (gamma(0.01,0.01)) shows that we get unreasonably many zeros, and the max observed value are also incredibly large (3e7 or more). Because of this, we also have very few observations in the plausible range (1-1000 duplicates). We also get some (1%) divergent transitions. Thus, we need to tighten that prior to some more realistic shape. After some experimenting, we arrive at gamma(0.5, 0.1) as a good compromise.

Prior Predictive Checks

M_ppc <- brm(data = d,
      family = zero_inflated_negbinomial,
      formula = formula,
      prior = priors,
      file = cachefile(paste0("ppc-",MODEL_CACHE)),
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      sample_prior = "only",
      backend="cmdstanr",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
m <- M_ppc
yrep <- posterior_predict(m)

Number of zeros

We expect the number of zeros (changes that do not remove duplicates) to dominate, as this would probably be even rarer than clone introductions.

ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0)) + ggtitle("Prior predicted proportion of zeros")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our prior for the zero-inflation are not particularly strong - but at least we assume that the overall ratio of zeros should be over half the changes - somewhat lower than the \(\approx 95\)% present in the data.

Max predicted value

While our priors place the maximum likely value towards the lower end of the thousands, they do not rule out even 150 thousand duplicates in a single change (which we think is quite unrealistic). But it is important to allow some very unlikely values in the priors, because parameter space where prior information is zero will never show up in the posterior, even if it is present in the data (following Bayes’ theorem). Thus, even though the priors show some unreasonable values, at least we know that if we ever see such data, our model will allow us to see it.

(sim_max <- ppc_stat(y = d$y, yrep, stat = "max") + ggtitle("Prior predicted max values")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scaling to more reasonable values show that our model actually expects somewhat smaller max values than observed, on average - but we see above that it does encompass also the observed max value, and quite a bit more range.

sim_max + scale_x_continuous(limits = c(0,1000)) + ggtitle("Prior predicted max values up to 1000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

99th percentile

The 99th percentile is a more stable metric than the maximum value for highly right-skewed data like ours. The priors do a good job of matching the observations.

ppc_stat(y = d$y, yrep, stat = "q99") + ggtitle("Prior predicted Q99 vs. observed value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

99th vs 95th percentile

We can even plot the 95th percentile versus the 99th, just to show how the spread of likely values vary.

(p <- ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + theme(legend.position="bottom") + ggtitle("Prior predicted Q95 vs Q99")
)

Standard deviation

The standard deviation of the predictions is a bit harder to grasp intuitively. But our prior information encompass the observed value well.

(p <- ppc_stat(y = d$y, yrep, stat = "sd") + ggtitle("Prior predicted stddev vs. observed value")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Zooming in to show the distribution in more detail.

p + scale_x_continuous(limits=c(0,30))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Group-level predictions

ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per repository")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Model execution

MR_intercepts_only <-
  brm(data = d,
      family = zero_inflated_negbinomial,
      file = cachefile(MODEL_CACHE),
      formula = formula,
      prior = priors,
      warmup = 1000,
      iter  = ITERATIONS,
      chains = CHAINS,
      cores = CORES,
      backend="cmdstanr",
      file_refit = "on_change",
      threads = threading(THREADS),
      save_pars = SAVE_PARS,
      adapt_delta = ADAPT_DELTA)
MR_intercepts_only <- add_criterion(MR_intercepts_only, criterion = "loo")
m <- MR_intercepts_only

Model diagnostics

p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) {
  start <- i
  end <- start+11
  mcmc_trace(m, pars = na.omit(pars[start:end]))
  })
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

The “caterpillar plots” show that the chains mixed well, and explored the available parameter space.

mcmc_plot(m, type="rhat")

mcmc_plot(m, type="rhat_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcmc_plot(m, type="neff")

mcmc_plot(m, type="neff_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both MCMC chains, Rhat and Neff ratio looks good.

loo <- loo(m)
loo
## 
## Computed from 12000 by 31007 log-likelihood matrix.
## 
##          Estimate    SE
## elpd_loo  -7535.6 172.8
## p_loo        89.9   7.0
## looic     15071.2 345.7
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.7]).
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. ESS
## (-Inf, 0.7]   (good)     31004 100.0%  1086    
##    (0.7, 1]   (bad)          3   0.0%  <NA>    
##    (1, Inf)   (very bad)     0   0.0%  <NA>    
## See help('pareto-k-diagnostic') for details.
plot(loo)

We have 3 observations with high Pareto k values. These might be highly influential points, and should be investigated by the reloo function. This function will, in turn, exclude each suspicious data point, and refit the model, comparing the result with the original model). Most likely due to sparse data for some observations.

If these calculations produce reasonable Pareto values (< 0.7), then we can still trust the results. Reloo is disabled by default, but could be enabled by changing the eval value of the cell

reloofile <- cachefile(paste0("reloo-", MODEL_CACHE, ".rds"))
if (file.exists(reloofile)) {
    (reloo <- readRDS(reloofile))
} else {
    Sys.time()
    (reloo <- reloo(m, loo, chains=CHAINS, cores=CORES) )
    Sys.time()
    saveRDS(reloo, reloofile)
}
## 
## Computed from 12000 by 31007 log-likelihood matrix.
## 
##          Estimate    SE
## elpd_loo  -7536.3 172.9
## p_loo        90.6   7.4
## looic     15072.6 345.8
## ------
## MCSE of elpd_loo is 0.2.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.7]).
## 
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.

Doing the reloo takes several hours, so it is best done manually (e.g over night).

# plotting the recalculated values
plot(reloo)

# which points have higher pareto_k than 0.5?
influencers <- data[loo::pareto_k_ids(reloo),]
influencers |> group_by(repo,committerteam) |> tally()
## # A tibble: 0 × 3
## # Groups:   repo [0]
## # ℹ 3 variables: repo <fct>, committerteam <fct>, n <int>

Posterior predictive checks

yrep <- posterior_predict(m)

Posterior proportion of zeros

ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0)) + ggtitle("Posterior predicted proportion of zeros")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of zeros are spot-on.

Posterior max value

sim_max <- ppc_stat(y = d$y, yrep, stat = "max") + ggtitle("Posterior predicted max value")
sim_max
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The most likely posterior max value is slightly below 100, and the observed value (from IntTest repository) is around 150. But our priors are not too strong, our models place some probability up to a few hundred removed duplicated. This means that we could trust that our model, in case such data should show up, will consider that data as well.

Posterior standard distribution

ppc_stat(y = d$y, yrep, stat = "sd") + ggtitle("Posterior predicted standard deviation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the probability mass are between 1.1nd 2 in standard deviation, slightly below the observed value.

Posterior Q95 vs Q99

ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + ggtitle("Posterior predicted Q95 vs Q99")

The model converges well, and predicts 95th percentile between 0 and 1, and 99th percentile between 3 and 5 (observed values, as indicated above, is 0 and 3, respectively).

Posterior grouped predictions

ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$team) + ggtitle("Posterior predictive max observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The max value per team varies between teams, but for most teams the observed value fall reasonably well within the predictions. Remember, this model does not take any numerical predictor (such as size of the change) into account, so different team behaviour due to this will not be visible.

ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$repo) + ggtitle("Posterior predictive max observation per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Posterior max per repository have some variation, in particular in Saturn, but still falls within the plausible range.

ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + ggtitle("Posterior predictive 99% quantile per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The 99th quantile are more stable metrics than the max value. The UI team shows that they are more reliant on priors (have a much wider \(x\) axis range in their predictions)—this is because they have supplied fewer data points than the other teams.

ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + ggtitle("Posterior predictive 99% quantile per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similar conclusions for the 99th quantile per repository.

Rootogram

rootogram <- pp_check(m, type = "rootogram", style="suspended")
## Using all posterior draws for ppc type 'rootogram' by default.
rootogram

Sizing the rootgram according to reasonable (observed) values.

p <- rootogram + scale_x_continuous(limits=c(0,50)) + scale_y_continuous(limits=c(-3,30)) + ggtitle("Suspended rootogram, y=0 removed, x scaled to 0-50 prediction interval")
p
## Warning: Removed 478 rows containing missing values or values outside the scale range
## (`geom_tile()`).
## Warning: Removed 477 rows containing missing values or values outside the scale range
## (`geom_smooth()`).

source("predict_utils.R")
# the predict functions use predictors, so we have to supply them. But this model do not consider these values, only the team and the repository, and the intercept
heatmap_by_team_and_repo(posterior_predict_by_team_and_repo(m, added=q95(data$ADD), removed=q95(data$DEL), complexity=q95(data$COMPLEX), duplicates = q95(data$DUP), summary=function(x) q99(x)), "Quantile 99%", decimals=0) + ggtitle("Quantile 99% per team and repo")

Comparing the with the observed data, we see that the predicted 99th percentile is a good fit, though slightly higher (the observed data contains more 0). But the trend is clear—Integration tests dominate also the removals, with Yellow, Brown and Orange leading the way, and Pink, UI and Unknown lagging behind. In Neptune, the Architects and Green team lead, with Blue and Brown following up.

removals_probability(onepred(model=MR_intercepts_only))
## `summarise()` has grouped output by 'added', 'removed', 'complexity',
## 'duplicates', 'team', 'repo'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'added', 'removed', 'complexity',
## 'duplicates', 'team'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'added', 'removed', 'complexity',
## 'duplicates', 'team', 'repo'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'added', 'removed', 'complexity',
## 'duplicates', 'team'. You can override using the `.groups` argument.

Apart from the Integration Test, and to some extent Neptune, this model does not really make much difference between teams in the various repositories. This is also illustrated in the heatmap above.

Conclusion

Our model converges well, but is limited in its predictive capabilities. It will be used as a baseline model.